Lecture 20
benchd = tibble(
x = runif(10000),
y = runif(10000)
)
(b = bench::mark(
d[d$x > 0.5, ],
d[which(d$x > 0.5), ],
subset(d, x > 0.5),
filter(d, x > 0.5)
))# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 172.36µs 184.01µs 5392. 253.49KB 14.9
2 d[which(d$x > 0.5), ] 131.61µs 145.2µs 6832. 274.31KB 38.7
3 subset(d, x > 0.5) 241.37µs 257.11µs 3862. 289.55KB 21.5
4 filter(d, x > 0.5) 1.39ms 1.43ms 693. 2.06MB 17.2
d = tibble(
x = runif(1e6),
y = runif(1e6)
)
(b = bench::mark(
d[d$x > 0.5, ],
d[which(d$x > 0.5), ],
subset(d, x > 0.5),
filter(d, x > 0.5)
))# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 16.9ms 17.4ms 57.0 13.4MB 49.4
2 d[which(d$x > 0.5), ] 13.3ms 13.5ms 73.7 24.8MB 156.
3 subset(d, x > 0.5) 23.4ms 23.5ms 42.5 24.8MB 92.1
4 filter(d, x > 0.5) 17.6ms 18.1ms 55.6 24.8MB 94.5
bench - relative results# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 d[d$x > 0.5, ] 1.28 1.29 1.34 1 1
2 d[which(d$x > 0.5), ] 1 1 1.73 1.86 3.15
3 subset(d, x > 0.5) 1.76 1.74 1 1.86 1.86
4 filter(d, x > 0.5) 1.32 1.34 1.31 1.86 1.91
parallelPart of the base packages in R
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
detectCores
pvec
mclapply
mcparallel & mccollect
detectCoresSurprisingly, detects the number of cores of the current system.
Parallelization of a vectorized function call
bench::system_timecores = c(1,4,8,16)
order = 6:8
f = function(x,y) {
system.time(
pvec(1:(10^y), sqrt, mc.cores = x)
)[3]
}
res = map(
cores,
function(x) {
map_dbl(order, f, x = x)
}
) %>%
do.call(rbind, .)
rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)
res
## 10^6 10^7 10^8
## 1 cores 0.011 0.133 1.662
## 4 cores 0.059 0.360 4.664
## 8 cores 0.072 0.372 3.971
## 16 cores 0.098 0.432 3.979Parallelized version of lapply
system.time(rnorm(1e6))
## user system elapsed
## 0.101 0.007 0.107
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))
## user system elapsed
## 0.148 0.136 0.106
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.242 0.061 0.052system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.097 0.047 0.079
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))
## user system elapsed
## 0.193 0.076 0.040
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))
## user system elapsed
## 0.162 0.083 0.041
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))
## user system elapsed
## 0.098 0.065 0.037 Asynchronously evaluation of an R expression in a separate process
List of 2
$ pid: int 44778
$ fd : int [1:2] 4 7
- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
List of 2
$ pid: int 44779
$ fd : int [1:2] 5 9
- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
Checks mcparallel objects for completion
Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop (and then some).
Core functions:
registerDoMC
foreach, %dopar%, %do%
registerDoMCPrimarily used to set the number of cores used by foreach, by default uses options("cores") or half the number of cores found by detectCores from the parallel package.
foreachA slightly more powerful version of base for loops (think for with an lapply flavor). Combined with %do% or %dopar% for single or multicore execution.
foreach - iteratorsforeach can iterate across more than one value, but it doesn’t do length coercion
foreach - combining resultsforeach - parallelizationSwapping out %do% for %dopar% will use the parallel backend.
user system elapsed
0.276 0.024 0.093
user system elapsed
0.283 0.032 0.065
user system elapsed
0.309 0.045 0.054
system.time( purrr::map(c(2,2,2), Sys.sleep) )
## user system elapsed
## 0.032 0.024 6.004
system.time( furrr::future_map(c(2,2,2), Sys.sleep) )
## user system elapsed
## 0.110 0.028 6.066
future::plan(future::multisession) # See also future::multicore
system.time( furrr::future_map(c(2,2,2), Sys.sleep) )
## user system elapsed
## 0.075 0.010 2.395 Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a samples of size n (with replacement) from the original data, and using that to repeat an analysis procedure of interest. Below is an example of fitting a local regression (loess) to some synthetic data, we will construct a bootstrap prediction interval for this model.
Optimal use of multiple cores is hard, there isn’t one best solution
Don’t underestimate the overhead cost
Experimentation is key
Measure it or it didn’t happen
Be aware of the trade off between developer time and run time
An awful lot of statistics is at its core linear algebra.
For example:
\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]
Principle component analysis
Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Numerical linear algebra \(\ne\) mathematical linear algebra
Efficiency and stability of numerical algorithms matter
Don’t reinvent the wheel - common core linear algebra tools (well defined API)
Low level algorithms for common linear algebra operations
BLAS
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
LAPACK
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated
Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - GPU libraries from Nvidia and UTK respectively
| n | 1 core | 2 cores | 4 cores | 8 cores |
|---|---|---|---|---|
| 100 | 0.001 | 0.001 | 0.000 | 0.000 |
| 500 | 0.018 | 0.011 | 0.008 | 0.008 |
| 1000 | 0.128 | 0.068 | 0.041 | 0.036 |
| 2000 | 0.930 | 0.491 | 0.276 | 0.162 |
| 3000 | 3.112 | 1.604 | 0.897 | 0.489 |
| 4000 | 7.330 | 3.732 | 1.973 | 1.188 |
| 5000 | 14.223 | 7.341 | 3.856 | 2.310 |
Sta 523 - Fall 2022